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Abstract 

This  paper  focuses  on  numerical  function  generators 
(NFGs)  based  on  k-th  order  polynomial  approximations. 
We  show  that  increasing  the  polynomial  order  k  reduces 
significantly  the  NFG’s  memory  size.  However,  larger  k 
requires  more  logic  elements  and  multipliers.  To  quantify 
this  tradeoff,  we  introduce  the  FPGA  utilization  measure, 
and  then  determine  the  optimum  polynomial  order  k.  Ex¬ 
perimental  results  show  that:  1 )  for  low  accuracies  ( up  to 
17  bits),  I  st  order  polynomial  approximations  produce  the 
most  efficient  implementations;  and  2 )  for  higher  accura¬ 
cies  (18  to  24  bits),  2nd-order  polynomial  approximations 
produce  the  most  efficient  implementations. 


1.  Introduction 

With  the  introduction  of  FPGAs,  it  is  possible  to  put, 
on  one  chip,  large  logic  systems,  including  general  purpose 
microprocessors  and  special  system-on-a-chip  designs.  In 
spite  of  a  large  amount  of  available  hardware,  designers  are 
often  limited  in  their  designs  because  a  specific  FPGA  re¬ 
source  is  scarce.  That  is,  FPGAs  consist  of  logic  modules, 
multiplexers,  adders,  multipliers,  and  memory  blocks.  An 
application  requiring  many  arithmetic  modules,  for  exam¬ 
ple,  may  exhaust  the  adders  and  multipliers  before  exhaust¬ 
ing  memory  modules.  Therefore,  the  success  of  a  design 
depends  on  achieving  a  balance  on  the  use  of  various  re¬ 
sources  [17].  In  this  paper,  we  show  a  design  of  a  numeri¬ 
cal  function  generator  (NFG)  that  adapts  to  the  FPGA’s  re¬ 
sources;  logic,  arithmetic  units,  and  memory. 

Numerical  functions  f(x),  such  as  trigonometric,  loga¬ 
rithmic,  square  root,  reciprocal,  and  combinations  of  these 
functions,  are  extensively  used  in  computer  graphics,  digital 
signal  processing,  communication  systems,  robotics,  astro¬ 
physics,  fluid  physics,  etc.  To  compute  elementary  func¬ 
tions,  iterative  algorithms,  such  as  the  CORDIC  (Coordi¬ 


nate  Rotation  Digital  Computer)  algorithm  [1,  30],  have 
been  often  used.  Although  the  CORDIC  algorithm  achieves 
accuracy  with  compact  hardware,  its  computation  time  is 
proportional  to  the  number  of  bits  used  to  represent  the 
number.  For  a  function  composed  of  elementary  functions, 
the  CORDIC  algorithm  is  slower,  since  it  computes  each 
elementary  function  sequentially.  It  is  too  slow  for  numer¬ 
ically  intensive  applications.  Implementation  by  a  single 
lookup  table  for  f{x)  is  simple  and  very  fast.  For  low- 
precision  computations  of  f(x)  (e.g.  x  and  f(x)  have  8  bits), 
this  implementation  is  straightforward.  For  high-precision 
computations,  however,  the  single  lookup  table  implemen¬ 
tation  is  impractical  due  to  the  huge  table  size. 

To  reduce  memory  size,  polynomial  approximations 
have  been  used  [3,  5,  6,  7,  12,  14,  22,  27,  28,  29].  These 
methods  approximate  the  given  numerical  functions  by 
piecewise  polynomials,  and  realize  the  polynomials  with 
hardware.  For  piecewise  polynomial  approximations,  in 
many  cases,  the  domain  is  partitioned  into  uniform  seg¬ 
ments.  For  elementary  functions,  such  as  sin(.r)  and  ex, 
by  using  higher-order  polynomial  approximations,  the  num¬ 
ber  of  uniform  segments  can  be  reduced,  and  therefore  the 
memory  size  can  be  reduced.  However,  for  some  numer¬ 
ical  functions,  such  as  sj  —  ln(A),  methods  based  on  uni¬ 
form  segmentation  yield  large  memory  size  for  implemen¬ 
tation  on  conventional  FPGAs  even  if  second-order  poly¬ 
nomials  are  used  [21].  On  the  other  hand,  since  our  NFG 
is  based  on  non-uniform  segmentation,  for  a  wide  range  of 
functions,  the  memory  size  can  be  reduced  by  using  second- 
order  polynomials  [21].  However,  although  second-order 
polynomial  approximations  reduce  memory  size,  more  mul¬ 
tipliers  and  adders  are  required.  To  produce  the  most  ef¬ 
ficient  FPGA  implementation  of  NFGs,  this  paper  intro¬ 
duces  a  measure,  the  FPGA  utilization  measure,  and  finds 
the  polynomial  order  k  that  produces  the  FPGA  implemen¬ 
tations  with  the  smallest  FPGA  utilization  measure. 

This  paper  focuses  on  the  implementation  of  table 
lookup  NFGs.  Fig.  1  shows  the  synthesis  flow  of  the  de- 
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sign  process,  which  begins  with  a  Design  Specification  de¬ 
scribed  by  Scilab  [26],  a  MATLAB-like  software  applica¬ 
tion,  and  ends  up  with  HDL  code.  The  Design  Specifica¬ 
tion  consists  of  a  function  f(x),  a  domain  over  x,  an  accu¬ 
racy,  and  an  order  k  of  the  approximation  polynomial.  This 
system  first  partitions  the  domain  into  segments,  and  then 
approximates  f(x)  by  a  polynomial  function  for  each  seg¬ 
ment.  Next,  it  analyzes  the  errors,  and  derives  the  necessary 
precision  for  computing  units  in  the  NFG.  Then,  it  gener¬ 
ates  HDL  code  to  be  mapped  into  an  FPGA  using  an  FPGA 
vendor-supplied  design  software.  The  significance  of  the 
design  flow  is  that  it  provides  the  context  of  the  implemen¬ 
tations  shown  here. 


2.  Preliminaries 

Definition  1  A  binary  fixed-point  representation  has  the 

form  d[_i  di-2  ■  ■■  do-  d-\  ...  d-m,  where  di  £  {0,1} 
(— m  <  i  <  l  —  1),  /  is  the  number  of  bits  for  the  integer 
part,  and  m  is  the  number  of  bits  for  the  fractional  part. 
This  representation  is  two ’s  complement.  In  this  paper,  we 
use  1=1. 

Definition  2  Error  is  the  absolute  difference  between  the 
exact  value  and  the  value  produced  by  the  hardware.  Ap¬ 
proximation  error  is  the  error  caused  by  a  function  approx¬ 
imation.  Rounding  error  is  the  error  caused  by  a  binaiy 
fixed-point  representation.  It  is  the  result  of  truncation  or 
rounding  whichever  is  applied.  However,  both  operations 
yield  an  error  that  is  called  rounding  error.  Acceptable  er¬ 
ror  is  the  maximum  error  that  an  NFG  may  assume.  Accept¬ 
able  approximation  error  is  the  maximum  approximation 
error  that  a  function  approximation  may  assume. 

Definition  3  Precision  is  the  total  number  of  bits  for  a  bi¬ 
nary  fixed-point  representation.  Specially,  n-bit  precision 
specifies  that  n  bits  are  used  to  represent  the  number.  In  this 
paper,  we  assume  that  an  n-bit  precision  NFG  has  an  n-bit 
input  and  an  acceptable  error  of  2~m,  where  m  =  n  —  1. 


Input:  Numerical  function /(x),  Domain  \a,b\  forx. 
Acceptable  approximation  error  Ea,  and 
Polynomial  order  k. 

Output:  Segments  [so,e0],[Ji,ei],---,k-i,ef-i]. 
Process: 

1 .  Let  so  =  a  and  i  =  0. 

2.  Find  a  value  p  (>  s-,)  where  e k(si,p)  =  Ea. 

3.  Up  >  b,  then  let  p  =  b. 

4.  Let  e/  =  p  and  i  =  i  +  1. 

5.  If  p  =  b,  then  let  t  =  i,  and  stop. 

6.  Else,  let  s,  =  p,  and  go  to  step  2. 


Fig.  2.  Nonuniform  segmentation  algorithm  for  the 
domain. 


the  memory  size  needed  for  storing  the  coefficients  of  the 
polynomial  functions. 

Piecewise  polynomial  approximations  have  been  applied 
to  a  domain  that  has  been  partitioned  into  uniform  seg¬ 
ments  [3,  5,  6,  7,  22,  27,  28,  29].  Such  methods  are  sim¬ 
ple  and  fast,  but  for  some  kinds  of  numerical  functions, 
too  many  segments  are  required,  resulting  in  large  mem¬ 
ory.  Further,  for  such  functions,  methods  based  on  uniform 
segmentation  cannot  always  reduce  the  memory  size,  even 
if  the  higher-order  polynomials  are  used.  For  example,  the 
reduction  in  the  number  of  segments  may  not  be  sufficient 
to  compensate  for  the  increase  in  word  width  due  to  an  in¬ 
crease  in  the  number  of  stored  coefficients  needed  for  the 
higher-order  polynomials. 

For  a  given  error,  non-uniform  segmentation  of  the  do¬ 
main  uses  fewer  segments  than  uniform  segmentation  [12, 
13,  25].  To  reduce  the  memory  size  for  a  wide  range  of 
functions  as  the  polynomial  order  increases,  we  use  non- 
uniform  segmentation. 

3.1.  Non-uniform  Segmentation  Algorithm 

The  number  of  non-uniform  segments  depends  on  the  ap¬ 
proximation  polynomial.  Specifically,  fewer  segments  are 
required  when  the  approximation  polynomial  is  more  accu¬ 
rate.  In  this  paper,  we  use  kth-order  Chebyshev  polynomials 
to  approximate  /(x) . 

For  a  segment  [,s\  e]  of  /(x),  the  maximum  approxima¬ 
tion  error  £k(s,e)  of  the  kth-order  Chebyshev  approxima¬ 
tion  [16]  is  given  by 


3.  Piecewise  Polynomial  Approximation 


Ek{s,e) 


2  (e-s)k+l 

4 k+1(k+  1)!  s<x<e 


l/(*+1)MI, 


(1) 


To  approximate  the  numerical  function  /(x)  using  poly¬ 
nomial  functions,  we  first  partition  the  domain  for  x  into 
segments.  For  each  segment,  we  approximate  /(x)  using  a 
polynomial  function  g(x)  =  c&x*  +  c^-ix*-1  +  . . .  +  co-  In 
this  case,  we  seek  the  fewest  segments,  since  this  reduces 


where  /^+1)  is  the  (k+  l)th-order  derivative  of  /.  From 
(1),  Ek(s,e)  is  a  monotone  increasing  function  of  segment 
width  e  —  s.  From  this  property,  it  follows  that  a  greedy  al¬ 
gorithm  in  which  each  segment  width  is  maximized  for  the 
given  approximation  error  produces  the  optimum  segmen- 


tation.  Fig.  2  shows  the  (nonuniform)  segmentation  algo¬ 
rithm.  The  inputs  for  this  algorithm  are  a  numerical  func¬ 
tion  fix),  a  domain  [a,b\  forx,  an  acceptable  approximation 
error  £.n ,  and  a  polynomial  order  k.  Then,  this  algorithm 
approximates  f(x)  with  acceptable  approximation  error  £a, 
and  produces?  segments  [so,eo],[si,ei],...,  [st-i,et-i].  For 
step  2  in  Fig.  2,  the  accurate  computation  of  the  value  p, 
where  £ k(si,p)  =  £«,  is  difficult.  We  obtain  the  maximum 
value  p'  satisfying  e>t  (s  i,P')<£a  by  scanning  values  of  n-bit 
input  x.  However,  it  has  time  complexity  0(2").  Therefore, 
we  compute  the  maximum  value  p'  by  setting  the  bits  of  p' 
to  0  or  1  from  the  most  significant  to  the  least  signihcant 
bits  such  that  e*  f.v, ,  p' )  <  ea,  as  in  a  binary  search.  This 
has  time  complexity  O(n).  In  the  computation  of  £/. (s/ .  //  j , 
the  value  of  max.Sj<x<pr  |/(*+1)(x)|  is  computed  by  nonlin¬ 
ear  programming  [10]. 

3.2.  Computation  of  Approximate  Value 

For  each  segment  [■?;,£,],  f(x)  is  approximated  by  the 
corresponding  polynomial  function  g(x,i).  That  is,  the  ap¬ 
proximated  value  y  of  /(x)  is  computed  as  y  =  g(x,i)  = 
Ck{i)xk  +  q-_ i(i)x*_1  +  . . .  +  co(i),  where  the  coefficients 
Cfc(i),  c*-i(i),...,c0(i)  are  derived  from  the  A:th-order 
Chebyshev  approximation  polynomial  [16].  Substituting 
x  cjj  +  qj  for  x  yields  the  transformation 

g(x,i)  =  ck{i)  {x  —  qi)k  +  c'k_i(i)  (x  —  qi)k~l 

+  ---  +  co(0)  (2) 

where 

c'j(i)=  X  +  (7  =  0,1,. --A-  1). 

;=o  \  J  / 

This  transformation  reduces  the  multiplier  size  (see  Sec¬ 
tion  4.2).  Instead  of  computing  g(x,i)  in  the  form  (2),  we 
apply  Horner’s  method  [18]  to  derive  (3)  below: 

g(x,i)  =  ((ck(i)(x—  qi)  +  c'k_x{i)){x  —  qi) 

+  ...)(x-qj)  +  c'0(i).  (3) 

This  reduces  the  number  of  multipliers  from  approximately 

k2 

T  to  k- 

4.  Architecture  for  NFGs 

Fig.  3  shows  the  architecture  realizing  (3).  It  consists  of 
the  segment  index  encoder,  the  coefficients  table,  multipli¬ 
ers,  and  adders. 

4.1.  Segment  Index  Encoder 

A  segment  index  encoder  converts  an  input  x  into  a  seg¬ 
ment  index  i  of  corresponding  segment  [.y, .  e, ] .  It  real¬ 
izes  the  segment  index  function  seg-func(x)  :  {0,1}"  — > 


x 


y 

Fig.  3.  Architecture  for  NFGs. 


(a)  Segment  index  function.  (b)  LUT  cascade. 

Fig.  4.  Segment  index  encoder. 

{0,1,...,?—  1}  shown  in  Fig.  4  (a),  where  x  has  n  bits, 
and  ?  denotes  the  number  of  segments.  NFGs  based  on 
uniform  segmentation  in  which  the  most  signihcant  bits  di¬ 
rectly  drive  the  address  inputs  of  a  coefficients  table  need 
no  segment  index  encoder.  On  the  other  hand,  NFGs  based 
on  non-uniform  segmentation  require  this  additional  circuit. 
Potentially,  this  is  a  complex  circuit. 

To  simplify  the  segment  index  encoder,  a  special  non- 
uniform  segmentation  [12,  13]  has  been  proposed.  This 
method  produces  a  simple  circuit  by  restricting  the  segmen¬ 
tation  points,  and  results  in  fewer  segments,  as  well  as  faster 
and  more  compact  NFGs  than  produced  by  uniform  seg¬ 
mentation.  In  this  method,  the  user  has  to  select  a  segmenta¬ 
tion  appropriate  to  the  given  function.  Thus,  it  is  difficult  for 
non-experts  to  obtain  optimum  segmentation  for  the  given 
function. 

For  a  fast  and  compact  realization  of  any  non-uniform 
segmentation,  we  use  an  LUT  cascade  [11,  23]  shown  in 
Fig.  4  (b).  By  using  an  LUT  cascade,  for  the  given  function, 
we  can  use  the  optimum  non-uniform  segmentation  gener¬ 
ated  by  the  algorithm  of  Fig.  2.  To  obtain  the  LUT  cascade, 
we  consider  seg-func(x)  as  a  multiple-output  logic  func¬ 
tion,  and  represent  the  logic  function  using  a  binary  deci¬ 
sion  diagram  (BDD)  [2,  4].  By  functional  decompositions 
using  the  BDD,  we  obtain  the  LUT  cascade.  To  produce 


compact  NFGs  for  a  wide  range  of  functions,  it  is  impor¬ 
tant  to  guarantee  that  the  size  of  LUT  cascade  is  reasonable 
for  any  non-uniform  segmentation.  In  [24],  we  have  shown 
that  the  size  of  an  LUT  cascade  depends  on  the  number  of 
segments,  and  by  using  an  LUT  cascade,  we  can  generate 
compact  NFGs  for  a  wide  range  of  functions.  In  this  pa¬ 
per,  we  reduce  the  number  of  non-uniform  segments  using  a 
high-order  polynomial  approximation,  and  thereby  we  sig¬ 
nificantly  reduce  the  memory  sizes  of  the  coefficients  table 
and  the  LUT  cascade.  Therefore,  our  NFGs  can  be  imple¬ 
mented  using  remaining  hardware  resources  in  an  FPGA. 
To  the  best  of  our  knowledge,  this  paper  presents  the  first 
NFG  that  uses  /cth-order  approximating  functions  in  opti¬ 
mum  non-uniform  segments,  for  k  >  2. 

4.2.  Size  Reduction  of  Multiplier 

The  number  of  bits  representing  x  —  qj  determines  the 
sizes  of  all  the  multipliers.  Therefore,  to  reduce  multiplier 
size,  we  reduce  the  number  of  bits  representing  x  —  <?,.  Re¬ 
ducing  the  value  of  x  —  c/,  reduces  not  only  the  sizes  of 
the  multipliers,  but  also  the  error  [8].  From  (2),  we  can 
choose  any  value  for  q,.  To  reduce  the  value  of  x  —  </,,  for 
a  segment  we  set  qj  =  (si  +  ef)/ 2.  Then,  we  have 

\x—qi\  <  (ej  —  si)/ 2.  Thus,  reducing  the  segment  width 
e,  —  Si  reduces  the  value  for  x—  q,.  However,  this  also  in¬ 
creases  the  number  of  segments,  and  results  in  increased 
memory  size.  We  show  a  reduction  method  of  segment 
width  that  does  not  increase  memory  size. 

Assume  that  the  coefficients  table  in  Fig.  3  has  2"  words, 
where  u=\  log2 1] ,  and  t  is  the  number  of  segments.  There¬ 
fore,  we  can  increase  the  number  of  segments  up  to  t  =  2" 
without  increasing  the  memory  size.  The  size  of  an  LUT 
cascade  also  depends  on  the  value  of  u.  However,  increas¬ 
ing  the  number  of  segments  to  t  =  2"  rarely  increases  the 
size  of  the  LUT  cascade.  We  reduce  the  size  of  segments 
by  dividing  larger  segments  into  two  equal  sized  segments 
increasing  t  up  to  the  next  power  of  2. 

5.  FPGA  Utilization  Measure 

Modern  FPGAs  consist  of  various  components,  such  as 
logic  elements,  memory  blocks,  and  embedded  multipli¬ 
ers.  It  is  important  to  use  these  hardware  resources  effi¬ 
ciently.  To  generate  the  most  efficient  NFGs  depending  on 
the  available  hardware  resources  in  an  FPGA,  we  introduce 
the  FPGA  utilization  measure. 

Definition  4  Given  available  hardware  resources  in  an 
FPGA,  the  FPGA  utilization  measure  U  is  the  sum  of  uti¬ 
lizations  for  those  hardware  resources.  In  this  paper,  we  as¬ 
sume  that  in  an  FPGA,  there  are  four  hardware  resources: 
logic  element  (LE),  embedded  multiplier  (DSP),  and  two 


Table  1.  Numbers  of  uniform  and  non-uniform 
segments  for  \/-ln(x)  and  arcsin(x). 


Function 

Domain 

k 

Uniform 

Non-uniform 

/0) 

[a,b\ 

segments 

segments 

sj-  ln(x) 

(0,1) 

i 

8,388,607 

8,230 

(0.0981%) 

2 

8,388,607 

698 

(0.0083%) 

3 

8,388,607 

213 

(0.0025%) 

4 

8,388,607 

111 

(0.0013%) 

5 

8,388,607 

75 

(0.0009%) 

arcsin(x) 

[0,D 

1 

8,388,608 

3,067 

(0.0366%) 

2 

8,388,608 

256 

(0.0031%) 

3 

8,388,608 

81 

(0.0010%) 

4 

8,388,608 

45 

(0.0005%) 

5 

4,194,304 

31 

(0.0007%) 

types  of  RAM  block  (M4K  and  M512).  That  is, 

R  I.F.  R_DSP  RM4K  RM512\ 

- + - + - + -  x  100%, 

A  I.F.  AJDSP  AM4K  AM512 ) 

where  R_LE  and  A_LE  denote  the  number  of  required  LEs 
and  available  LEs,  respectively.  For  DSP,  M4K,  andM512, 
we  use  a  similar  notation. 

Using  this  measure,  we  find  a  polynomial  order  that  pro¬ 
duces  the  most  efficient  NFGs  depending  on  the  unused 
hardware  resources  in  an  FPGA.  Specifically,  we  seek  the 
smallest  FPGA  utilization  measure  across  the  various  or¬ 
ders.  Note  that  we  can  find  a  polynomial  order  that  produces 
a  feasible  FPGA  implementation  by  using  a  large  penalty 
for  the  measure  (e.g.  U  =  °°)  when  a  required  resource 
is  larger  than  the  available  resource  (e.g.  A_LE  <  R_LE). 
However,  our  experimental  results  show  that  the  smallest 
FPGA  utilization  measure  results  in  a  feasible  and  efficient 
FPGA  implementation  even  if  such  penalty  is  not  used. 

By  using  NFGs  with  the  smallest  FPGA  utilization  mea¬ 
sure,  we  can  leave  more  hardware  resources  for  other  mod¬ 
ules.  This  is  useful  in  an  incremental  design  of  modules, 
such  as  occurs  when  a  final  design  is  the  result  of  a  sequence 
of  specification  changes. 

6.  Experimental  Results 

In  this  section,  we  find  the  optimum  polynomial  order  for 
a  given  precision  to  approximate  a  function  using  the  FPGA 
utilization  measure  discussed  in  the  previous  section. 

6.1.  Number  of  Segments  and  Memory  Size 

Table  1  compares  the  numbers  of  uniform  and  non- 
uniform  segments  for  24-bit  precision  NFGs  of  \J  —  ln(x) 
(0  <  x  <  1)  and  arcsin(x)  (0  <  x  <  1).  From  this  table,  we 


Precision 


Precision 


Fig.  5.  Number  of  non-uniform  segments  versus 
precision. 

can  see  that  the  number  of  non-uniform  segments  signifi¬ 
cantly  decreases  as  the  polynomial  order  k  increases,  while 
the  number  of  uniform  segments  does  not  always  decrease. 
The  number  of  segments  determines  the  size  of  coefficients 
table.  Many  existing  methods  are  based  on  uniform  seg¬ 
mentation.  Thus,  for  these  functions,  the  existing  methods 
cannot  always  reduce  the  memory  size  even  if  the  polyno¬ 
mial  order  increases.  On  the  other  hand,  our  method  can 
reduce  the  memory  size  by  increasing  the  polynomial  order 
for  a  wide  range  of  functions. 

In  the  following,  we  conduct  experiments  using  three  nu¬ 
merical  functions:  cos(ju')  (0  <  x  <  1/2),  y/. x  (1/32  <  x  < 
2),  and  1/x  (1  <  x  <  2).  The  experimental  results  shown  in 
Fig.  5-13  are  averages  for  these  functions. 

Fig.  5  shows  the  relation  between  the  number  of  non- 
uniform  segments  and  precision.  Note  the  increase  in  the 
number  of  segments  with  precision,  especially  for  5th-order 
approximations.  Further,  there  is  a  significant  decrease  in 
the  number  of  segments  as  the  order  k  of  the  polynomial 
increases.  For  example,  for  24-bit  precision,  a  5th-order 
polynomial  yields  an  approximation  that  uses  only  0.37% 
of  the  segments  needed  in  a  lst-order  polynomial. 

Fig.  6  shows  the  relation  between  total  memory  size  and 
precision.  That  is,  our  NFGs  require  memory  in  the  segment 
index  encoder,  as  well  as  in  the  coefficients  memory.  Total 
memory  size  is  the  sum  of  these  two.  Fig.  6  shows  that  the 
total  memory  size  increases  exponentially  with  the  preci¬ 
sion.  From  Fig.  5  and  Fig.  6,  we  can  see  that  the  total  mem¬ 
ory  size  strongly  depends  on  the  number  of  non-uniform 
segments.  Thus,  we  can  reduce  the  total  memory  size  by 
increasing  polynomial  order.  Especially,  for  high-precision, 
the  increase  of  polynomial  order  reduces  the  memory  size 
significantly.  For  24-bit  precision,  the  5th-order  polynomi¬ 
als  require  only  0.27%  of  total  memory  size  needed  for  the 
lst-order  polynomials. 

However,  for  low-precision,  an  increase  of  polynomial 


Fig.  6.  Total  memory  size  versus  precision. 
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Fig.  7.  Number  of  logic  elements  versus  precision. 

order  does  not  always  reduce  the  total  memory  size  be¬ 
cause,  while  it  reduces  the  length  of  the  coefficients  table 
(i.e.  the  number  of  words  of  the  coefficients  table),  it  also 
increases  the  width  of  coefficients  table  (i.e.  the  bit-width  of 
the  coefficients  table  because  more  coefficients  are  needed 
in  higher-order  polynomials).  In  fact,  for  an  8-bit  preci¬ 
sion  NFG  for  y/x,  the  5th-order  polynomial  requires  more 
memory  than  the  4th-order  polynomial  (both  polynomials 
require  the  same  number  of  segments). 

6.2.  FPGA  Resources 

We  implemented  fully  pipelined  NFGs  on  the  Altera 
Stratix  EP1S80F1020C5  FPGA  using  the  Quartus  II  ver.  5.0 
development  tool.  We  used  the  speed  optimization  option 
and  set  the  required  operating  frequency  to  200  MHz.. 

Fig.  7  shows  the  relation  between  the  number  of  logic  el¬ 
ements  (LEs)  and  precision.  This  graph  shows  that  the  num¬ 
ber  of  LEs  increases  approximately  linearly  with  the  preci¬ 
sion.  Further,  increasing  the  polynomial  order  increases  the 
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Fig.  8.  Number  of  DSPs  versus  precision. 


Fig.  9.  Number  of  RAM  blocks  versus  precision. 


Fig.  10.  FPGA  utilization  measure  on  the  Stratix 
EP1S80F1020C5  versus  precision,  where  all  the 
resources  are  available. 

Fig.  9  shows  the  relation  between  the  number  of  RAM 
blocks  and  precision.  The  FPGA  (EP1S80F1020C5)  in¬ 
cludes  two  types  of  RAM  block,  M4K  and  M512.  But, 
we  show  only  the  number  of  M4Ks  because  few  M512s 
were  used  for  the  implementations.  From  Fig.  9,  we  can 
see  that  the  number  of  RAM  blocks  increases  exponentially 
with  precision.  Further,  increasing  the  polynomial  order  can 
reduce  the  number  of  RAM  blocks.  For  24-bit  precision,  a 
5th-order  polynomial  requires  only  3%  of  the  RAM  blocks 
required  by  a  lst-order  polynomial. 

From  these  results,  we  can  see  that  by  changing  the  poly¬ 
nomial  order,  we  can  change  the  amount  of  FPGA  resources 
required  by  NFGs. 

6.3.  FPGA  Utilization  Measure 


number  of  LEs.  The  8  to  15-bit  precision  5th-order  poly¬ 
nomials  and  8  to  1 1-bit  precision  4th-order  polynomials  re¬ 
quire  fewer  LEs,  since  they  require  only  one  segment  for 
cos(7t.r)  and  therefore  no  memory  address  registers  for  the 
LUT  cascade  and  the  coefficients  table.  Note  that  when  the 
number  of  segments  is  one,  coefficients  of  the  polynomial 
are  implemented  as  constant  values,  not  memory.  Since  the 
22  to  24-bit  precision  lst-order  polynomials  require  large 
LUT  cascades  due  to  the  large  number  of  segments,  the 
number  of  pipeline  stages  for  the  LUT  cascade  increases, 
and  therefore  the  number  of  pipeline  registers  (LEs)  in¬ 
creases  excessively. 

Fig.  8  shows  the  relation  between  the  number  of  DSPs 
(9  x  9-bit  multipliers)  and  precision.  This  graph  shows  that 
the  number  of  DSPs  increases  as  the  precision  increases. 
Further,  increasing  the  polynomial  order  increases  the  num¬ 
ber  of  DSPs.  For  24-bit  precision,  5th-order  polynomials 
require  20  times  more  DSPs  than  needed  for  1  st-order  poly¬ 
nomials. 


We  determine  the  optimum  polynomial  order  given  the 
precision  and  the  available  hardware  resources  using  the 
FPGA  utilization  measure. 

Fig.  10  shows  the  relation  between  FPGA  utilization 
measure  and  precision  when  all  the  resources  on  the  Stratix 
EP1S80F1020C5  are  available  for  a  single  NFG.  This 
FPGA  consists  of  79,040  LEs,  176  DSPs,  364  M4Ks,  and 
767  M5 12s.  From  Fig.  10,  we  can  see  that  for  low-precision 
(up  to  17  bits),  the  lst-order  polynomials  yield  the  smallest 
FPGA  utilization  measure,  and  for  high-precision  ( 1 8  to  24 
bits),  the  2nd-order  polynomials  yield  the  smallest  FPGA 
utilization  measure.  We  view  this  result  as  very  surprising, 
especially  since  the  number  of  segments  decreases  signifi¬ 
cantly,  when  the  order  of  the  polynomial  increases. 

We  assume  a  situation  in  which  only  10%  of  LEs,  M4Ks, 
and  M5 12s,  and  100%  of  DSPs  are  available.  Fig.  1 1  shows 
the  FPGA  utilization  measure  for  this  case.  From  Fig.  11, 
it  follows  that,  for  precisions  of  13  bits  or  less,  lst-order 
polynomials  yield  the  smallest  FPGA  utilization  measure. 


Precision 


Fig.  11.  FPGA  utilization  measure  on  the  Stratix 
EP1S80F1020C5,  where  10%  of  LEs  and  RAM 
blocks,  and  100%  of  DSPs  are  available. 


Fig.  13.  FPGA  utilization  measure  on  the  smallest 
Cyclone  II  EP2C5F256C6  versus  precision. 


device  in  the  Cyclone  II  family:  4,608  LEs,  26  DSPs,  26 
M4Ks,  0  M512s).  Fig.  13  shows  results  for  the  Cyclone  II. 
For  high-precision,  the  lst-order  polynomials  deplete  the 
RAM  blocks  in  the  Cyclone  II.  On  the  other  hand,  the  4th 
and  5th-order  polynomials  deplete  the  DSPs.  Fig.  13  shows 
that,  for  up  to  15-bit  precision,  lst-order  polynomials  yield 
the  smallest  FPGA  utilization  measure.  On  the  other  hand, 
for  16  to  24-bit  precision,  2nd-order  polynomials  yield  the 
smallest  FPGA  utilization  measure. 

These  experiments  show  that  there  is  limited  use  for  4th 
and  5th-order  polynomials.  However,  from  Fig.  6,  we  con¬ 
jecture  that  for  higher-precision  than  24-bit,  4th  and  5th- 
order  polynomials  will  be  useful  to  reduce  the  memory  size. 
Unfortunately,  we  could  not  verify  that  because  of  the  pre¬ 
cision  of  our  NFG  synthesis  tool  developed  by  C  language. 


Fig.  12.  FPGA  utilization  measure  on  the  Stratix 
EP1S80F1020C5,  where  10%  of  LEs  and  DSPs, 
and  100%  of  RAM  blocks  are  available. 

For  14  to  23-bit  precision,  2nd-order  polynomials  yield  the 
smallest  FPGA  utilization  measure.  And,  for  24-bit  preci¬ 
sion,  3rd-order  polynomials  yield  the  smallest  FPGA  uti¬ 
lization  measure.  Note  that  for  precisions  higher  than  20 
bits,  the  lst-order  polynomials  cannot  be  implemented  in 
the  FPGA  due  to  insufficient  RAM  blocks.  Fig.  12  shows 
the  FPGA  utilization  measure,  where  only  10%  of  LEs 
and  DSPs,  and  100%  of  RAM  blocks  are  available.  From 
Fig.  12,  we  can  see  that  for  up  to  23-bit  precision,  the  lst- 
order  polynomials;  and  for  24-bit  precision,  the  2rd-order 
polynomials  yield  the  smallest  FPGA  utilization  measure. 
Note  that  the  3rd,  4th,  and  5th-order  polynomials  cannot  be 
implemented  in  the  FPGA  due  to  insufficient  DSPs. 

In  order  to  understand  how  a  reduction  over  all  resources 
affects  the  realization,  we  implemented  the  NFGs  on  a  low- 
cost  FPGA,  the  Cyclone  II  (EP2C5F256C6,  the  smallest 


7.  Conclusion  and  Comments 

We  have  presented  NFGs  based  on  k-lh  order  polyno¬ 
mial  approximation  for  (k  +  l)-times  differentiable  func¬ 
tions.  To  generate  the  most  efficient  NFGs,  we  introduced 
the  FPGA  utilization  measure.  Experimental  results  showed 
that:  1 )  For  some  kinds  of  numerical  functions,  the  existing 
methods  based  on  uniform  segmentation  cannot  always  re¬ 
duce  the  memory  size,  even  if  the  higher-order  polynomials 
are  used.  On  the  other  hand,  our  method  can  flexibly  change 
the  amount  of  hardware  resources  required  by  NFGs  for  a 
wide  range  of  functions  by  changing  the  polynomial  order 
k.  2)  When  all  hardware  resources  in  an  FPGA  can  be  used 
for  a  single  NFG,  for  low  accuracies  (up  to  17  bits),  1st 
order  polynomials  produce  the  most  efficient  FPGA  imple¬ 
mentation.  On  the  other  hand,  for  high  accuracies  (18  to 
24  bits),  2nd  order  polynomials  produce  the  most  efficient 
FPGA  implementation.  Even  if  the  amount  of  hardware  re¬ 
sources  is  constrained,  we  can  find  the  optimum  polynomial 
order  for  the  precision  of  NFG  and  the  resource  constraints. 


Currently,  we  are  developing  the  synthesis  system  that 
automatically  generates  an  NFG  based  on  the  optimum 
polynomial  order  k  from  a  given  amount  of  hardware  re¬ 
sources.  In  this  system,  an  accurate  estimate  of  hardware 
resources  required  by  NFG  is  important. 
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